Linear Regression & Friends (OLS, Ridge, Lasso, Elastic Net)#

Linear regression is the hello world of predictive modeling: it’s simple enough to understand end-to-end, but rich enough to teach you the habits you’ll reuse everywhere:

  • how to translate a question into a loss function

  • how to solve an optimization problem (closed-form vs iterative)

  • how to diagnose overfitting

  • how regularization changes a model’s behavior

This notebook builds intuition first (with analogies), then stays honest to the math.


Learning goals#

By the end you should be able to:

  • fit simple and multiple linear regression

  • compute coefficients (“betas”) via:

    • closed-form normal equation

    • Cholesky factorization (linear algebra)

    • gradient descent (optimization)

  • understand and fit Ridge (L2), Lasso (L1) and Elastic Net (L1+L2)

  • use scikit-learn versions and interpret the key parameters

Notation (quick)#

  • Targets: \(y \in \mathbb{R}^n\)

  • Features: \(X \in \mathbb{R}^{n\times d}\) (each row is a sample)

  • Coefficients: \(\beta \in \mathbb{R}^d\)

  • Intercept: \(b \in \mathbb{R}\) (sometimes written \(\beta_0\))

  • Prediction: \(\hat{y} = b + X\beta\)


Table of contents#

  1. The story: one dial vs many dials

  2. Simple linear regression (one feature)

  3. Multiple linear regression (many features)

  4. Getting betas

    • closed-form / normal equation

    • Cholesky solve

    • gradient descent

  5. Regularization

    • Ridge (L2)

    • Lasso (L1)

    • Elastic Net (L1 + L2)

  6. scikit-learn equivalents + parameter intuition

  7. Practical checklist + pitfalls

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, SGDRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)

1) The story: one dial vs many dials#

Imagine you’re predicting someone’s commute time.

  • With one feature (say, distance), linear regression is like a single dial: turn the slope up/down until the line fits.

  • With many features (distance, traffic, weather, time of day), it becomes a mixing board with many sliders.

OLS (ordinary least squares) says:

“Pick the dial settings (betas) that make the squared mistakes as small as possible.”

That’s it. Everything else in this notebook is different ways to do that—and to prevent the model from getting too clever.

2) Simple linear regression (one feature)#

Model#

With one feature \(x\), we predict:

\[ \hat{y} = \beta_0 + \beta_1 x \]
  • \(\beta_0\) is the intercept (where the line crosses the y-axis)

  • \(\beta_1\) is the slope (how much \(y\) changes when \(x\) increases by 1)

Loss (OLS)#

We choose betas to minimize the mean squared error (MSE):

\[ J(\beta_0, \beta_1) = \frac{1}{n} \sum_{i=1}^n (\beta_0 + \beta_1 x_i - y_i)^2 \]
# Synthetic "one-feature" dataset
n_samples_simple = 120
x_simple = rng.uniform(0, 10, size=n_samples_simple)

beta0_true = 3.0
beta1_true = 2.0
noise = rng.normal(0, 2.0, size=n_samples_simple)

y_simple = beta0_true + beta1_true * x_simple + noise

fig = px.scatter(
    x=x_simple,
    y=y_simple,
    title="Synthetic data: one feature",
    labels={"x": "x", "y": "y"},
)
fig.show()

Closed-form for the one-feature case#

When there’s only one feature, OLS has a compact solution.

Let \(\bar{x}\) and \(\bar{y}\) be the means. Then:

\[ \beta_1 = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sum_i (x_i - \bar{x})^2} \qquad \beta_0 = \bar{y} - \beta_1 \bar{x} \]

Interpretation:

  • numerator = how \(x\) and \(y\) “move together” (covariance)

  • denominator = how much \(x\) varies

If \(x\) doesn’t vary at all, you can’t learn a slope.

# Closed-form simple regression (one feature)
x_mean = x_simple.mean()
y_mean = y_simple.mean()

beta1_hat = np.sum((x_simple - x_mean) * (y_simple - y_mean)) / np.sum((x_simple - x_mean) ** 2)
beta0_hat = y_mean - beta1_hat * x_mean

beta0_hat, beta1_hat
(2.996017195166255, 1.9971905099769254)
# Plot the fitted line
x_line = np.linspace(x_simple.min(), x_simple.max(), 200)
y_line = beta0_hat + beta1_hat * x_line

fig = go.Figure()
fig.add_trace(go.Scatter(x=x_simple, y=y_simple, mode="markers", name="data"))
fig.add_trace(go.Scatter(x=x_line, y=y_line, mode="lines", name="OLS fit"))
fig.update_layout(
    title="Simple linear regression: OLS line",
    xaxis_title="x",
    yaxis_title="y",
)
fig.show()

Residuals: “what the story doesn’t explain”#

Residuals are \(r_i = y_i - \hat{y}_i\).

A friendly mental model: if the model is a story about the data, residuals are the parts the story can’t explain.

y_pred_simple = beta0_hat + beta1_hat * x_simple
residuals = y_simple - y_pred_simple

fig = px.histogram(residuals, nbins=30, title="Residual distribution (simple regression)")
fig.update_layout(xaxis_title="residual", yaxis_title="count")
fig.show()

3) Multiple linear regression (many features)#

With multiple features, we write predictions compactly.

Create a design matrix with an intercept column:

\[\begin{split} X = \begin{bmatrix} 1 & x_{11} & x_{12} & \dots & x_{1d} \\ 1 & x_{21} & x_{22} & \dots & x_{2d} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \dots & x_{nd} \end{bmatrix} \quad \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_d \end{bmatrix} \end{split}\]

Then:

\[ \hat{y} = X\beta \]

Same idea, more dials.

Why multiple regression gets tricky (multicollinearity)#

If two features carry almost the same information (e.g., x2 x1), OLS can still fit well, but the individual coefficients can become unstable.

Analogy: two coworkers both trying to take credit for the same project.

We’ll build a dataset with correlated features to make this visible.

# Synthetic multi-feature dataset with correlated features
n_samples_multi = 300
x1 = rng.normal(0, 1, size=n_samples_multi)

# x2 is highly correlated with x1
x2 = 0.85 * x1 + rng.normal(0, 0.25, size=n_samples_multi)

# x3 is mostly independent
x3 = rng.normal(0, 1, size=n_samples_multi)

X_raw = np.column_stack([x1, x2, x3])
feature_names = ["x1 (signal)", "x2 (correlated)", "x3 (signal)"]

true_intercept = 1.5
true_beta = np.array([2.0, -1.5, 0.7])

y_multi = true_intercept + X_raw @ true_beta + rng.normal(0, 1.0, size=n_samples_multi)

corr = np.corrcoef(X_raw, rowvar=False)
fig = px.imshow(corr, x=feature_names, y=feature_names, text_auto=True, title="Feature correlation")
fig.show()
X_train, X_test, y_train, y_test = train_test_split(X_raw, y_multi, test_size=0.25, random_state=42)
X_train.shape, X_test.shape
((225, 3), (75, 3))

4) Getting betas (three ways)#

4.1 Closed-form (normal equation)#

OLS minimizes:

\[ J(\beta) = \|X\beta - y\|_2^2 \]

Take derivatives, set to zero → the normal equation:

\[ X^\top X\,\beta = X^\top y \]

If \(X^\top X\) is invertible:

\[ \beta = (X^\top X)^{-1} X^\top y \]

In practice: avoid explicit inverses. Use solve, lstsq, or factorizations.

def add_intercept_column(X: np.ndarray) -> np.ndarray:
    return np.column_stack([np.ones(X.shape[0]), X])


def ols_via_lstsq(X: np.ndarray, y: np.ndarray) -> np.ndarray:
    # Solves min ||Xb - y||_2 using a stable method (SVD under the hood)
    beta, *_ = np.linalg.lstsq(X, y, rcond=None)
    return beta


def ols_via_normal_equation_solve(X: np.ndarray, y: np.ndarray) -> np.ndarray:
    # Solves (X^T X) b = X^T y without forming an explicit inverse
    XtX = X.T @ X
    Xty = X.T @ y
    return np.linalg.solve(XtX, Xty)


def ols_via_cholesky(X: np.ndarray, y: np.ndarray) -> np.ndarray:
    # If XtX is SPD, we can solve using a Cholesky factorization: XtX = L L^T
    XtX = X.T @ X
    Xty = X.T @ y

    L = np.linalg.cholesky(XtX)
    z = np.linalg.solve(L, Xty)
    beta = np.linalg.solve(L.T, z)
    return beta


X_design = add_intercept_column(X_train)

beta_lstsq = ols_via_lstsq(X_design, y_train)
beta_solve = ols_via_normal_equation_solve(X_design, y_train)
beta_chol = ols_via_cholesky(X_design, y_train)

beta_lstsq, beta_solve, beta_chol
(array([ 1.5167,  2.2547, -1.8184,  0.674 ]),
 array([ 1.5167,  2.2547, -1.8184,  0.674 ]),
 array([ 1.5167,  2.2547, -1.8184,  0.674 ]))
# Compare to the true parameters
beta_true = np.concatenate([[true_intercept], true_beta])

labels = ["intercept", "x1", "x2", "x3"]

fig = go.Figure()
fig.add_trace(go.Bar(name="true", x=labels, y=beta_true))
fig.add_trace(go.Bar(name="OLS (lstsq)", x=labels, y=beta_lstsq))
fig.add_trace(go.Bar(name="OLS (solve)", x=labels, y=beta_solve))
fig.add_trace(go.Bar(name="OLS (cholesky)", x=labels, y=beta_chol))
fig.update_layout(barmode="group", title="Coefficients: true vs OLS solutions")
fig.show()

4.2 Gradient descent (optimization)#

Closed-form is great, but:

  • it can be expensive when \(d\) is large

  • it doesn’t exist for Lasso

  • it’s not how many modern models are trained

Gradient descent treats the loss like a landscape and repeatedly takes steps “downhill”.

For MSE with an intercept column included, a common gradient form is:

\[ \nabla_\beta\, J(\beta) = \frac{2}{n} X^\top (X\beta - y) \]
# Demonstrate on the simple 1D dataset
X_simple = add_intercept_column(x_simple.reshape(-1, 1))

beta_gd, gd_steps, gd_losses, gd_betas = gradient_descent_linear_regression(
    X_simple,
    y_simple,
    learning_rate=0.01,
    n_steps=3000,
)

beta_gd
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[10], line 4
      1 # Demonstrate on the simple 1D dataset
      2 X_simple = add_intercept_column(x_simple.reshape(-1, 1))
----> 4 beta_gd, gd_steps, gd_losses, gd_betas = gradient_descent_linear_regression(
      5     X_simple,
      6     y_simple,
      7     learning_rate=0.01,
      8     n_steps=3000,
      9 )
     11 beta_gd

NameError: name 'gradient_descent_linear_regression' is not defined
fig = px.line(x=gd_steps, y=gd_losses, title="Gradient descent: MSE over iterations")
fig.update_layout(xaxis_title="iteration", yaxis_title="MSE")
fig.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 fig = px.line(x=gd_steps, y=gd_losses, title="Gradient descent: MSE over iterations")
      2 fig.update_layout(xaxis_title="iteration", yaxis_title="MSE")
      3 fig.show()

NameError: name 'gd_steps' is not defined
# Visualize the loss surface J(beta0, beta1) and the GD path
beta0_grid = np.linspace(beta0_hat - 6, beta0_hat + 6, 80)
beta1_grid = np.linspace(beta1_hat - 2.5, beta1_hat + 2.5, 80)

B0, B1 = np.meshgrid(beta0_grid, beta1_grid)

# Compute MSE on the grid
Y_pred_grid = B0[..., None] + B1[..., None] * x_simple[None, None, :]
MSE_grid = np.mean((Y_pred_grid - y_simple[None, None, :]) ** 2, axis=-1)

fig = go.Figure()
fig.add_trace(
    go.Contour(
        x=beta0_grid,
        y=beta1_grid,
        z=MSE_grid,
        contours_coloring="heatmap",
        showscale=True,
        name="MSE",
    )
)
fig.add_trace(
    go.Scatter(
        x=gd_betas[:, 0],
        y=gd_betas[:, 1],
        mode="lines+markers",
        name="GD path",
        line=dict(color="black"),
    )
)
fig.update_layout(
    title="Loss landscape (contours) + gradient descent path",
    xaxis_title="beta0",
    yaxis_title="beta1",
)
fig.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[12], line 24
     11 fig = go.Figure()
     12 fig.add_trace(
     13     go.Contour(
     14         x=beta0_grid,
   (...)     20     )
     21 )
     22 fig.add_trace(
     23     go.Scatter(
---> 24         x=gd_betas[:, 0],
     25         y=gd_betas[:, 1],
     26         mode="lines+markers",
     27         name="GD path",
     28         line=dict(color="black"),
     29     )
     30 )
     31 fig.update_layout(
     32     title="Loss landscape (contours) + gradient descent path",
     33     xaxis_title="beta0",
     34     yaxis_title="beta1",
     35 )
     36 fig.show()

NameError: name 'gd_betas' is not defined

5) Regularization: Ridge, Lasso, Elastic Net#

When features are correlated or numerous, OLS can “spread credit” in unstable ways.

Regularization adds a preference:

  • Ridge (L2): “Prefer smaller coefficients.” (shrinks smoothly)

  • Lasso (L1): “Prefer fewer non-zero coefficients.” (can set some to zero)

  • Elastic Net: “A mix of both.”

Analogy:

  • Ridge is like a safety belt: you can still move, but extreme swings are damped.

  • Lasso is like a budget: if you pay for one coefficient, you have less to spend on others.

Important: regularization depends on feature scale → standardize your features.

Standardization helpers (from scratch)#

We’ll standardize using train statistics:

\[ X_{scaled} = \frac{X - \mu_{train}}{\sigma_{train}} \]

Then we fit on scaled features and convert back to original units.

def standardize_fit(X_train: np.ndarray) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
    mean = X_train.mean(axis=0)
    std = X_train.std(axis=0, ddof=0)
    std = np.where(std == 0, 1.0, std)
    return mean, std, (X_train - mean) / std


def standardize_apply(X: np.ndarray, mean: np.ndarray, std: np.ndarray) -> np.ndarray:
    return (X - mean) / std


def unscale_coefficients(
    beta_scaled: np.ndarray,
    feature_mean: np.ndarray,
    feature_std: np.ndarray,
    target_mean: float,
) -> tuple[float, np.ndarray]:
    # If y_centered = y - target_mean and X_scaled = (X - mean)/std
    # then y = target_mean + X_scaled @ beta_scaled
    # and beta_original = beta_scaled / std
    beta_original = beta_scaled / feature_std
    intercept_original = target_mean - feature_mean @ beta_original
    return float(intercept_original), beta_original


def predict_linear(X: np.ndarray, intercept: float, beta: np.ndarray) -> np.ndarray:
    return intercept + X @ beta


feature_mean, feature_std, X_train_scaled = standardize_fit(X_train)
y_train_mean = y_train.mean()
y_train_centered = y_train - y_train_mean

X_test_scaled = standardize_apply(X_test, feature_mean, feature_std)

X_train_scaled.shape
(225, 3)

5.1 Ridge regression (L2)#

Ridge regression (as implemented in sklearn.linear_model.Ridge) minimizes:

\[ \|y - X\beta\|_2^2 + \alpha\|\beta\|_2^2 \]
  • \(\alpha \ge 0\) controls the strength (bigger → more shrinkage)

  • \(\alpha = 0\) reduces to OLS

The solution is still linear algebra—just with a “stabilized” matrix:

\[ (X^\top X + \alpha I)\,\beta = X^\top y \]

Practical note:

  • We typically do not penalize the intercept.

  • In this notebook we center \(y\) and standardize \(X\), then recover the intercept from the training mean.

def ridge_closed_form(X: np.ndarray, y: np.ndarray, alpha: float) -> np.ndarray:
    n_features = X.shape[1]
    XtX = X.T @ X
    Xty = X.T @ y
    return np.linalg.solve(XtX + alpha * np.eye(n_features), Xty)


alphas = np.logspace(-3, 2, 40)
ridge_coefs_scaled = []

for alpha in alphas:
    ridge_coefs_scaled.append(ridge_closed_form(X_train_scaled, y_train_centered, alpha))

ridge_coefs_scaled = np.vstack(ridge_coefs_scaled)

fig = go.Figure()
for j, name in enumerate(feature_names):
    fig.add_trace(go.Scatter(x=alphas, y=ridge_coefs_scaled[:, j], mode="lines", name=name))

fig.update_layout(
    title="Ridge coefficient paths (scaled features)",
    xaxis_title="alpha (log scale)",
    yaxis_title="coefficient (scaled)",
    xaxis_type="log",
)
fig.show()
# Evaluate ridge for a few alpha values (convert back to original units)

def fit_ridge_and_score(alpha: float) -> dict:
    beta_scaled = ridge_closed_form(X_train_scaled, y_train_centered, alpha)
    intercept, beta = unscale_coefficients(beta_scaled, feature_mean, feature_std, y_train_mean)
    y_pred = predict_linear(X_test, intercept, beta)
    return {
        "alpha": alpha,
        "mse": mean_squared_error(y_test, y_pred),
        "r2": r2_score(y_test, y_pred),
        "intercept": intercept,
        "beta": beta,
    }

[fit_ridge_and_score(a) for a in [0.0, 0.1, 1.0, 10.0]]
[{'alpha': 0.0,
  'mse': 1.221855599191807,
  'r2': 0.5613497744532487,
  'intercept': 1.5167030598770732,
  'beta': array([ 2.2547, -1.8184,  0.674 ])},
 {'alpha': 0.1,
  'mse': 1.2206829487952962,
  'r2': 0.5617707598473143,
  'intercept': 1.5172826696255273,
  'beta': array([ 2.2349, -1.7957,  0.6741])},
 {'alpha': 1.0,
  'mse': 1.2137240283725328,
  'r2': 0.564269035433256,
  'intercept': 1.5220139241124422,
  'beta': array([ 2.073 , -1.6111,  0.6742])},
 {'alpha': 10.0,
  'mse': 1.2517741006491223,
  'r2': 0.5506089328833015,
  'intercept': 1.5458280302704095,
  'beta': array([ 1.2685, -0.7048,  0.6596])}]

Ridge via Cholesky (linear algebra)#

Ridge requires solving a symmetric positive definite system (for \(\alpha > 0\)):

\[ (X^\top X + \alpha I)\,\beta = X^\top y \]

A classic approach is Cholesky factorization:

  • factor \(A = X^\top X + \alpha I\) as \(A = LL^\top\)

  • solve \(Lz = X^\top y\)

  • solve \(L^\top\beta = z\)

This avoids computing any inverse explicitly.

def ridge_via_cholesky(X: np.ndarray, y: np.ndarray, alpha: float) -> np.ndarray:
    n_features = X.shape[1]
    A = (X.T @ X) + alpha * np.eye(n_features)
    b = X.T @ y

    L = np.linalg.cholesky(A)
    z = np.linalg.solve(L, b)
    return np.linalg.solve(L.T, z)


alpha_demo = 1.0
beta_solve = ridge_closed_form(X_train_scaled, y_train_centered, alpha=alpha_demo)
beta_chol = ridge_via_cholesky(X_train_scaled, y_train_centered, alpha=alpha_demo)

np.max(np.abs(beta_solve - beta_chol))
2.220446049250313e-15

Choosing \(\alpha\): a quick validation curve#

In real projects you tune \(\alpha\) using a validation set or cross-validation.

Below is a simple validation curve for Ridge (using an sklearn pipeline so scaling is fit only on the training split).

X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=7)

alphas_ridge = np.logspace(-3, 3, 60)
val_mse = []

for a in alphas_ridge:
    model = make_pipeline(StandardScaler(), Ridge(alpha=a))
    model.fit(X_tr, y_tr)
    val_mse.append(mean_squared_error(y_val, model.predict(X_val)))

val_mse = np.array(val_mse)
best_alpha = float(alphas_ridge[np.argmin(val_mse)])

fig = px.line(x=alphas_ridge, y=val_mse, title=f"Ridge validation curve (best alpha ≈ {best_alpha:.4g})")
fig.update_layout(xaxis_title="alpha (log scale)", yaxis_title="validation MSE", xaxis_type="log")
fig.show()

best_alpha
0.001

5.2 Lasso regression (L1)#

Lasso replaces the L2 penalty with L1:

\[ \frac{1}{2n}\|X\beta - y\|_2^2 + \alpha\|\beta\|_1 \]

Key effect: L1 tends to create sparsity (some coefficients become exactly 0).

There is no closed-form solution in general.

A common solver is coordinate descent:

  • hold all coefficients fixed

  • update one coefficient at a time

  • repeat until convergence

The core operation becomes soft-thresholding:

\[ S(z, \gamma) = \mathrm{sign}(z)\,\max(|z| - \gamma, 0) \]
def soft_threshold(z: float, gamma: float) -> float:
    if z > gamma:
        return z - gamma
    if z < -gamma:
        return z + gamma
    return 0.0


def lasso_coordinate_descent(
    X: np.ndarray,
    y: np.ndarray,
    alpha: float,
    n_iter: int = 5000,
    tol: float = 1e-7,
    warm_start: np.ndarray | None = None,
) -> np.ndarray:
    # Objective used here:
    #   (1/(2n)) ||y - Xb||^2 + alpha ||b||_1
    # Assumes y is centered and columns of X are standardized.

    n_samples, n_features = X.shape
    beta = np.zeros(n_features) if warm_start is None else warm_start.copy()

    residuals = y - X @ beta
    feature_norms = np.mean(X ** 2, axis=0)

    for _ in range(n_iter):
        beta_prev = beta.copy()

        for j in range(n_features):
            residuals = residuals + X[:, j] * beta[j]

            rho = np.mean(X[:, j] * residuals)
            beta[j] = soft_threshold(rho, alpha) / feature_norms[j]

            residuals = residuals - X[:, j] * beta[j]

        if np.max(np.abs(beta - beta_prev)) < tol:
            break

    return beta


alphas = np.logspace(-3, 0.8, 45)
lasso_coefs_scaled = []

beta_ws = None
for alpha in alphas:
    beta_ws = lasso_coordinate_descent(X_train_scaled, y_train_centered, alpha=alpha, warm_start=beta_ws)
    lasso_coefs_scaled.append(beta_ws.copy())

lasso_coefs_scaled = np.vstack(lasso_coefs_scaled)

fig = go.Figure()
for j, name in enumerate(feature_names):
    fig.add_trace(go.Scatter(x=alphas, y=lasso_coefs_scaled[:, j], mode="lines", name=name))

fig.update_layout(
    title="Lasso coefficient paths (scaled features)",
    xaxis_title="alpha (log scale)",
    yaxis_title="coefficient (scaled)",
    xaxis_type="log",
)
fig.show()
# Evaluate lasso for a few alpha values

def fit_lasso_and_score(alpha: float) -> dict:
    beta_scaled = lasso_coordinate_descent(X_train_scaled, y_train_centered, alpha=alpha)
    intercept, beta = unscale_coefficients(beta_scaled, feature_mean, feature_std, y_train_mean)
    y_pred = predict_linear(X_test, intercept, beta)
    return {
        "alpha": alpha,
        "mse": mean_squared_error(y_test, y_pred),
        "r2": r2_score(y_test, y_pred),
        "beta": beta,
        "num_nonzero": int(np.sum(np.abs(beta_scaled) > 1e-10)),
    }

[fit_lasso_and_score(a) for a in [0.01, 0.05, 0.1, 0.2]]
[{'alpha': 0.01,
  'mse': 1.2158339344013227,
  'r2': 0.5635115721487047,
  'beta': array([ 2.023 , -1.5535,  0.6674]),
  'num_nonzero': 3},
 {'alpha': 0.05,
  'mse': 1.27909288052184,
  'r2': 0.5408014000121864,
  'beta': array([ 1.0961, -0.494 ,  0.6406]),
  'num_nonzero': 3},
 {'alpha': 0.1,
  'mse': 1.3964369179099974,
  'r2': 0.4986745001551752,
  'beta': array([0.6242, 0.    , 0.5845]),
  'num_nonzero': 2},
 {'alpha': 0.2,
  'mse': 1.5759458587189812,
  'r2': 0.4342301931310043,
  'beta': array([0.4975, 0.    , 0.4452]),
  'num_nonzero': 2}]

5.3 Elastic Net (L1 + L2)#

Elastic Net combines both penalties:

\[ \frac{1}{2n}\|X\beta - y\|_2^2 + \alpha\left(\rho\|\beta\|_1 + \frac{1-\rho}{2}\|\beta\|_2^2\right) \]
  • \(\alpha\) controls overall regularization

  • \(\rho \in [0,1]\) (often called l1_ratio) controls the mix

    • \(\rho=1\) → Lasso

    • \(\rho=0\) → Ridge

Elastic Net is especially useful when features are correlated: it can select groups of features instead of arbitrarily picking one.

def elastic_net_coordinate_descent(
    X: np.ndarray,
    y: np.ndarray,
    alpha: float,
    l1_ratio: float,
    n_iter: int = 5000,
    tol: float = 1e-7,
    warm_start: np.ndarray | None = None,
) -> np.ndarray:
    # Objective used here:
    #   (1/(2n)) ||y - Xb||^2 + alpha * (l1_ratio ||b||_1 + (1-l1_ratio)/2 ||b||_2^2)

    n_samples, n_features = X.shape
    beta = np.zeros(n_features) if warm_start is None else warm_start.copy()

    residuals = y - X @ beta
    feature_norms = np.mean(X ** 2, axis=0)

    l1 = alpha * l1_ratio
    l2 = alpha * (1.0 - l1_ratio)

    for _ in range(n_iter):
        beta_prev = beta.copy()

        for j in range(n_features):
            residuals = residuals + X[:, j] * beta[j]

            rho = np.mean(X[:, j] * residuals)
            beta[j] = soft_threshold(rho, l1) / (feature_norms[j] + l2)

            residuals = residuals - X[:, j] * beta[j]

        if np.max(np.abs(beta - beta_prev)) < tol:
            break

    return beta


alphas = np.logspace(-3, 0.8, 45)
l1_ratio = 0.5

enet_coefs_scaled = []
beta_ws = None
for alpha in alphas:
    beta_ws = elastic_net_coordinate_descent(
        X_train_scaled, y_train_centered, alpha=alpha, l1_ratio=l1_ratio, warm_start=beta_ws
    )
    enet_coefs_scaled.append(beta_ws.copy())

enet_coefs_scaled = np.vstack(enet_coefs_scaled)

fig = go.Figure()
for j, name in enumerate(feature_names):
    fig.add_trace(go.Scatter(x=alphas, y=enet_coefs_scaled[:, j], mode="lines", name=name))

fig.update_layout(
    title=f"Elastic Net coefficient paths (l1_ratio={l1_ratio})",
    xaxis_title="alpha (log scale)",
    yaxis_title="coefficient (scaled)",
    xaxis_type="log",
)
fig.show()

6) scikit-learn equivalents + parameter intuition#

The sklearn versions are production-grade implementations with good defaults.

A few important notes:

  • Most sklearn linear models expect raw features and handle intercept internally (fit_intercept=True by default).

  • For Ridge/Lasso/ElasticNet, feature scaling is critical → use StandardScaler().

  • LinearRegression solves OLS using a stable decomposition (no gradient descent).

Below we compare OLS/Ridge/Lasso/ElasticNet with a consistent pipeline.

models = {
    "OLS": make_pipeline(StandardScaler(), LinearRegression()),
    "Ridge(alpha=1.0)": make_pipeline(StandardScaler(), Ridge(alpha=1.0)),
    "Lasso(alpha=0.1)": make_pipeline(StandardScaler(), Lasso(alpha=0.1, max_iter=50_000)),
    "ElasticNet(alpha=0.1,l1_ratio=0.5)": make_pipeline(
        StandardScaler(), ElasticNet(alpha=0.1, l1_ratio=0.5, max_iter=50_000)
    ),
    "SGDRegressor (GD)": make_pipeline(
        StandardScaler(),
        SGDRegressor(
            loss="squared_error",
            penalty="l2",
            alpha=1e-4,
            learning_rate="invscaling",
            eta0=0.01,
            max_iter=5000,
            tol=1e-6,
            random_state=42,
        ),
    ),
}

rows = []
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    rows.append(
        {
            "model": name,
            "mse": mean_squared_error(y_test, y_pred),
            "r2": r2_score(y_test, y_pred),
        }
    )

rows
[{'model': 'OLS', 'mse': 1.2218555991918068, 'r2': 0.5613497744532487},
 {'model': 'Ridge(alpha=1.0)',
  'mse': 1.2137240283725332,
  'r2': 0.5642690354332558},
 {'model': 'Lasso(alpha=0.1)',
  'mse': 1.3964372563318266,
  'r2': 0.4986743786606125},
 {'model': 'ElasticNet(alpha=0.1,l1_ratio=0.5)',
  'mse': 1.365775969037568,
  'r2': 0.5096818828174675},
 {'model': 'SGDRegressor (GD)',
  'mse': 1.2112856232633256,
  'r2': 0.5651444309806803}]

Parameter cheat sheet (the ones you’ll actually touch)#

LinearRegression#

  • fit_intercept: include intercept term

  • positive: force coefficients to be non-negative

Ridge#

  • alpha: L2 strength (bigger → more shrinkage)

  • solver: numerical method (usually leave as auto)

  • fit_intercept: intercept handling

Lasso#

  • alpha: L1 strength (bigger → more sparsity)

  • max_iter, tol: convergence controls

  • selection: cyclic vs random coordinate updates

ElasticNet#

  • alpha: overall regularization

  • l1_ratio: mix between L1 and L2

  • max_iter, tol: convergence controls

SGDRegressor#

  • loss="squared_error": linear regression loss

  • penalty: "l2", "l1", "elasticnet"

  • alpha: regularization strength (note: different meaning/scale vs Ridge/Lasso)

  • learning_rate, eta0: step size schedule

7) Practical checklist + pitfalls#

  • Always split train/test (or use CV) before deciding on alpha.

  • For Ridge/Lasso/ElasticNet:

    • standardize features (StandardScaler)

    • don’t leak test information into scaling

  • Don’t interpret Lasso sparsity as “the truth” without domain checks.

  • If features are strongly correlated:

    • Ridge often stabilizes coefficients

    • Elastic Net can select groups

    • Lasso may pick one arbitrarily

When to use what (rough intuition)#

  • OLS: small/clean problems, inference-focused, or as a baseline

  • Ridge: lots of correlated features, prediction-focused stability

  • Lasso: you want a sparse model (feature selection)

  • Elastic Net: you want sparsity and correlated-feature friendliness


Exercises#

  1. Increase correlation between x1 and x2 and see how OLS coefficients behave.

  2. Plot test MSE vs alpha for Ridge/Lasso and pick the best.

  3. Add irrelevant noisy features and compare OLS vs Ridge.

References#

  • ESL (Hastie, Tibshirani, Friedman): The Elements of Statistical Learning

  • ISLR (James, Witten, Hastie, Tibshirani): An Introduction to Statistical Learning

  • sklearn docs for Ridge, Lasso, ElasticNet